Introduction to Data Wrangling -- Reading

Definition points out that:
	(1)the ultimate goal of data wrangling is to make data useful, and
	(2)the output of data wrangling consists of both a clean data and an editable and auditable 
transcript of data manipulation performed. The transcript should also be reusable and could be 
potentially adapted to other similar datasets.

In conclusion, real-world data is usually incomplete, dirty and inconsistent, and particularly there
is a lot of it. Therefore, data wrangling techniques, in particular, automated techniques, are needed
to improve the accuracy and efficiency of the downstream data analysis tools.
	"Current data quality problems cost U.S. business more than 600 billion dollars a year."
	"Between 30% to 80% of the data analysis task is spent on cleaning and understanding the data."

	Data cleaning: A set of operations that impute missing values, resolve, identify/remove 
outliers, unify data formats and so on.
	Data integration: A process of merging data from different sources into a coherent store. It 
often comprises resolving duplicated records, detecting conflicts in data values, finding redundant
attributes values, schema matching, etc.
	Data normalisation and aggregation: Data normalisation is to adjust attribute values measured 
on different scales to a common scale, and data aggregation is to combine data from several
measurements.
	Data reduction: It produces a reduced representation of the original data in volume via 
diverse techniques, e.g., feature selection, dimension reduction, instance sampling.
	Data discretization: Transforming numerical attributes into nominal attributes (part of data
reduction but with particular importance in data preprocessing) is called data discretization. There
are various techniques can be used to perform data discretization, such as binning methods,
histogram analysis, clustering analysis, segmentation by natural partitioning, etc.

1. Data acquisition: Gather data from different resources, e.g., the web, sensors, and conventional
databases via API requests (e.g., Twitter's API and Google API), web scraping (acquiring data from
the Internet through many ways other than API access), etc. Tools used include various python
package, pandas, R, etc.
2. Data loading & extracting: Load and parse data stored in many different formats, like XML,
JSON, CSV, natural language text, etc. Tools used include, for instance, Beautiful Soup
(http://www.crummy.com/software/BeautifulSoup/) (one of many python packages for parsing XML/HTML),
regular expressions, NLTK (http://www.nltk.org/) (a python package for natural language processing).
3. Data cleaning:Diagnose and handle various data quality problems. As aforementioned, performing 
data cleaning we need a set of operations that impute missing values, resolve inconsistencies, 
identify/remove outliers, unify data formats and other problems discussed in "Why do We Wrangle 
Data?".
4. Data integration: Merge data from different 4. resources to create a rich and complete data set. 
It involves a set of operations that resolve related issues, such as data duplication, entity 
matching, and schema matching.
5. Data profiling:Utilises different kinds of descriptive statistics and visualisation tools to 
improve data quality. The data profiling process might uncover more data quality problems, and 
suggests more operations for data cleaning and data integration.
6. Data enrichment:Enrich existing data by feature generation, data transformation, data
aggregation and data reduction, etc. After performing data cleaning and profiling, we now have a
good sense of the data. Then we should think about what new kinds of data we can derive from the
data we already have, or that we can merge from other related sources.
7. Data storing:Finally store the clean data in various formats, which are easily accessible by
downstream analysis tools.
8. Documenting the process: Besides a cleaning script, we should also keep a detailed description
of all data manipulations applied in the above tasks and generate a proper code book that
describes each variable and its values in the clean data.